Fix slowdown after generating backslash escape sequences #91

AriX · 2024-04-19T16:46:57Z

Addresses an issue I've seen which I believe is the same as what's reported in #90: an edge case where JSONSchemaParser's shortcut_key implementation did not consider the correct parser while finishing parsing an escape sequence. When the UnionParser that StringParsingState pops onto the stack after getting a BACKSLASH is finished but not yet removed from the stack, shortcut_key fails to return json_freetext.

noamgat · 2024-04-19T17:48:21Z

OK, now I think I understand the problem you reported and the solution you proposed. However I think your specific solution may create illegal outputs, as if the topmost parser still has potential characters that can be inserted, we are not necessarily returning the correct results.
I think the safe thing to do, which will also fix the performance problem, is to check (iteratively) if the topmost parser is can_end=True and get_allowed_characters() is empty - because that means that it is effectively done and will be ignored.

Better yet, this check should happen not inside the shortcut key, but during the jsonschemaparser itself - if a member in the stack is in can_end=True and get_allowed_characters()=empty, it shouldn't be there anymore because we know what will happen next time. I think this is the correct solution for this problem.
If you want, you can implement it that way. If not, I will try to get to it.

AriX · 2024-04-19T17:54:53Z

I think the safe thing to do, which will also fix the performance problem, is to check (iteratively) if the topmost parser is can_end=True and get_allowed_characters() is empty - because that means that it is effectively done and will be ignored.

Good point! I just implemented this.

Better yet, this check should happen not inside the shortcut key, but during the jsonschemaparser itself - if a member in the stack is in can_end=True and get_allowed_characters()=empty, it shouldn't be there anymore because we know what will happen next time. I think this is the correct solution for this problem.

That sounds very reasonable but I'm not sure exactly where in the lifecycle of JSONSchemaParser you had mind for this check to happen.

For now, I implemented the former change. If you want to give more specific guidance I'm happy to try the latter.

AriX · 2024-04-19T17:58:38Z

However I think your specific solution may create illegal outputs, as if the topmost parser still has potential characters that can be inserted, we are not necessarily returning the correct results.

On 2nd thought, I'm not totally sure this is true. If the topmost parser's "can end" is true, then the JSONSchemaParser's get_allowed_characters() already includes the allowed characters of the next parser, without regard for whether or not the topmost parser has allowed characters left. It seems correct for the logic in shortcut_key() should match the logic in get_allowed_characters().

Let me know if I'm missing something.

noamgat · 2024-04-19T18:04:34Z

get_allowed_characters() not of the JSONSchemaParser, but of a specific object in its stack.
If the specific parser is "finished but not removed", it will have can_end=True and get_allowed_character=''

I've implemented it in this branch:
https://github.com/noamgat/lm-format-enforcer/tree/feature/escaping_performance

can you check if it solves the performance issue you are encountering?

AriX · 2024-04-19T18:09:08Z

get_allowed_characters() not of the JSONSchemaParser, but of a specific object in its stack. If the specific parser is "finished but not removed", it will have can_end=True and get_allowed_character=''

Yeah, I got that. I guess I understand you original point now, in that if the topmost parser has some characters that are allowed but wouldn't be allowed by json_freetext, we might prevent those from being generated.

can you check if it solves the performance issue you are encountering?

Yes, it will take me a few minutes as I don't use lm-format-enforcer directly but maintain a port of it on the side

noamgat · 2024-04-19T19:57:24Z

You can also see the diff here: d5dcdc5 and apply it locally

AriX · 2024-04-19T22:23:53Z

You can also see the diff here: d5dcdc5 and apply it locally

My port is in Swift so applying the patches takes some work :)
May open-source it some day.

You can also see the diff here: d5dcdc5 and apply it locally

This change doesn't work for me; in my tests it strips StringParserState off of the stack too early, preventing its parsedString from being read.

I can't reproduce this in the Python implementation. I suspect the reason is because my version has a stricter whitespace policy, which means after reading a string my string parser has 0 allowed characters where this implementation has whitespace in its allowed characters.

So, for my use case I think I need to go with the version that alters the shortcut_key() implementation to check for can_end() and empty get_allowed_characters().

But, of course I defer to you if you'd prefer to use an implementation like what you have in feature/escaping_performance here.

noamgat · 2024-04-20T06:00:13Z

You're right, added handling of the StringParser in this commit: 34032f6
Unit tests now pass

noamgat · 2024-04-20T06:09:54Z

I merged the holistic solution together with the string parser fix to master. Thank you so much for spotting this issue, the unit test suite now runs twice as fast in the CI! This is a great find. I will be closing this PR soon.

AriX · 2024-04-20T06:31:10Z

Thank you so much for spotting this issue, the unit test suite now runs twice as fast in the CI! This is a great find.

That's great - glad I could help!

Fix slowdown after generating backslash escape sequences

15b0121

PR feedback

ac331fb

noamgat mentioned this pull request Apr 20, 2024

Feature/escaping performance #93

Merged

AriX closed this Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slowdown after generating backslash escape sequences #91

Fix slowdown after generating backslash escape sequences #91

AriX commented Apr 19, 2024 •

edited

Loading

noamgat commented Apr 19, 2024

AriX commented Apr 19, 2024 •

edited

Loading

AriX commented Apr 19, 2024

noamgat commented Apr 19, 2024

AriX commented Apr 19, 2024 •

edited

Loading

noamgat commented Apr 19, 2024

AriX commented Apr 19, 2024 •

edited

Loading

noamgat commented Apr 20, 2024

noamgat commented Apr 20, 2024

AriX commented Apr 20, 2024 •

edited

Loading

Fix slowdown after generating backslash escape sequences #91

Fix slowdown after generating backslash escape sequences #91

Conversation

AriX commented Apr 19, 2024 • edited Loading

noamgat commented Apr 19, 2024

AriX commented Apr 19, 2024 • edited Loading

AriX commented Apr 19, 2024

noamgat commented Apr 19, 2024

AriX commented Apr 19, 2024 • edited Loading

noamgat commented Apr 19, 2024

AriX commented Apr 19, 2024 • edited Loading

noamgat commented Apr 20, 2024

noamgat commented Apr 20, 2024

AriX commented Apr 20, 2024 • edited Loading

AriX commented Apr 19, 2024 •

edited

Loading

AriX commented Apr 19, 2024 •

edited

Loading

AriX commented Apr 19, 2024 •

edited

Loading

AriX commented Apr 19, 2024 •

edited

Loading

AriX commented Apr 20, 2024 •

edited

Loading